.J 



DOCDHBHT BBSOHEf 



BD 179 206 

AOXHOR - 
TITLE 



INSTITUTION 

PEPORT NO 
PUB DATE 
NOTE 

AVAILABLE FROM 



IB 007 .868 

Conti^ Dennis M. 

Coagut«r Science and Technology: Findings of the 
Standard Benchmark Library Study Group, Final 
Report. ' • . 

National Bureau of Standards 
D« C. 

NBS-SP-5O0-3B 

Jan 7 9 \ 
57p, . 

Superintendent cf Documents^ 
Office, Washington, DC 20U02 
003-003-02005-5, $2.40) 



(D.CC> , Washington, 



U.-S. Government 'Printing 
(Stock No. 



EDRS PRICE 
DESCRIPTORS 



MF01/PC03 Plus Postage. 

♦Computer Programfe-^ *Computer Science; *Cost 
Effectiveness: Federal Governf^^t; *Itfcrnation 
Systems; *0n line Systems; Privat€j^ Agencies; Program 
Evaluation 



ABSTRACT 



This retort presents the findin^gs of a joint 



government/industry stud^ group which investigated the technical 
feasibility of standard benchmark program^ for testing vendor systems 
in the competitive selection cf computer systems within both private 
industry and the federal government. As part of its investigation, 
the study group reviewed earlier efforts to develop and use such ' 
programs on the part of t'lje DepaTtment of Defence, the Auerbach 
Corporation, H. Lucas, the Hitre Corporation, and the Department , of 
Agriculture (USDA) , Several issues dealing with the i iplementation, ( 
maintenance, cost -benefit , and acceptability of standard benchmarks 
emerged as a result of this review. The problem.s encpulitered by the 
study group, notably the lack of an accepted definition of 
"representativeness," prevented it from arriving at a definitive 
statement of feasibility. However, several areas th^^ were identified 
as topics requiring further investigation are presented in this 
report. A list of references, a glossary of terffs, the USDATmapping 
procedure, and sample evaluation criteria are appended. 
(Author/FM) 



^ Reproducti^ond supplied by EDPS are the best that can be made ^ * 
* from th€ original document* * 



ERLC 



COMPUTER SCIENCE & TECHNOLOGY: 



9 

Findings of the Standard. Benchmark Library 
Study Group 



Dennis Conli 



Institute lor (.oiiipiilcr 
National Bureau o( Staiuiards 
Washington. I)( ?.{)? ^ \ 



iiu! I cchnologv. 



us Dr»'AHIMt NTOF HtAi IH 

NAVONAL iNiTIUMC OF 
COOCAIION 

,M.S \nn l.Ml'^NT MAS IM I N Mt r«0 
,HK ! t> I'XAt t\ > A'. M! M IVI t> • MOM 
,M| »M MSON OR OHI.ANI/AIION OMiwiN 
AT.N(. II PO.NlSn* VH W r>M OflNKlNS 
MAM 1> tU) NOl N« t I SWHil V «l I'M* 
M.N1 l»f t l» 'Al NAHONAt -NST.UIM Of 
I AT. UN POSITION OH fOl U Y 




1 • 



U.S. DEPARTMENT OF COMMERCE, Juanita M. Kreps, Secretary 

Jordan J. Baruch. Assistant Secretary for Science and Technology 
NATIONAL BUREAU OF STANDARDS. Ernest Ambler, Director 



I 



Issued January i97§ 



I 




-Reports on Computer Science and 1 echnoloRy 

riie Nalional Bureau of Standards has a special respoiisibilily willun the I cderal 
Ciovernmenl for computer science jind technology activities The programs of the 
NBS Institute for C omputer Sciences and l echnology arc designed to provide A DP 
standards, guidelines, and technical advjs6ry services to improve the ef fectiveness of 
computer utilization in (he I cderal sector, and to perform appropriatt; research and 
development efforts' as foundation for such activjlies and programs This publication 
scriol.will report these NBS efforts to the l ederal coniputer cmnmunily as well ii^ to 
interested specialists Tn the academic and private sect<yrs I hose wishing to i\-celvc 
notices of publications in this series should complete aiu! return the form at the end 
of this publicalron 



Nttiona! Bureau of Standards Special Publication >(K)- 

( ()l)l N XNHSAV 



1r 



library of ('(mfrfftn Catalofinf in l\iblicati(m I>a(a 

lindmgN ol the su^iulard benchmark hhrary study ^iou[> 

(C omputer science A technology) (NBS s|kciiiI puhlicaiion; MK>- \H) 

Supt ofDiKs no i i n(>-MK>v^8 

I Mectronic digital computers Evaluation 1 IitUv 11 Jy^ncs lU 
Scries Uniial States National Bureau of Standards S|H'cml publica 
tfon. 5(K>- ^8. . ' 

Q< l(K).llVno MK>-^K[QA76<> l'<>4)«)2* ls[(K)l 6M]7«-ft()M68 




U S CiOVliRNMlNI PRlNlINCiOl HC I 
* ' \WASHlN(rWN: 1979 

I Ol salcMw the Sii|Kiin(cn(k-nt of nociimcnl^, U.S. CJovcinmcut Prmting Oflice. Washington. O.C. $ft40? 

S((>ck No {){)\ {):im ^ I'licc %? M) 

(^Add 2S pcicciil additional Ini ofhci ihaii t'S mailing). \ ' 

ERIC *^ ^ 



ACKNOWLEDGMENTS ; " 

\ 

The authoL" wishes to ^hank Mr. Terry Potter of 
the Digital Equipment Co^rporation (formerly with 
Bel 1 Te lephone La bora tor ies)^ and Mr • Nor r is Gof f ^ 
of the U.S. Department of Agricivlture for their 
participation as study group members and as 0 
clbnt r ibutor s to this report. 




I 



J 



ERIC 



^5 



iii 



TABLE OF CONTENTS » ' 

Page 

1 . Int r oduct ion 1 

1^1 Background 2 

1.2 Perspective '3 

.2. Previous Efforts . 4 

2.1 Department of De*fense Efforts ... 5 

2.2 Auerbach Standard Benchmarks 7 

2 . 3 Lucas Modules , . 8 

2.4 Mitre Study ' 8 

- 2.5 -^Department of Agriculture Experience 9 

3. The Benchmark Library Study Group 13 

.3.1 Implementation Issues ' 15 

3.2 Maintenancer Issues 17 

3 Cost/Berref i-t Evaluation ' 18 

3.4 Acceptability to Agencies and Vendors 19 

4. Problems Encountered in Attempting to Determine 

Technical Feasibility 19 

5. Conclusions ^ - 20 

References ' ^ 23 

Appendices * 

Append,ix A: Glossary of Terms ' ^ . 

Append ix B : US DA Wor k load Maj^ping Pr ocedur e* 
Appendix C: Sample Evaluation Criteria 

' my 

- 5 , • A 

iv • ^ 



ERIC 



FINDINGS OF THE STANDARD BENCHMARK 
LIBRARY STUDY GROUP 



by 

Dennis Conti 
ABSTRACT 

This report presents the findings of a 
Government- industry studTy group investigating the 
techn ica 1 feasibi 1 i ty ^ of standard benchmark 
programs. As p»r t of its investigation, the ^udy 
group reviewed earlier efforts to develop and use 
standard benchmark programs. Several issues 
dealing with the impl ementa t ion, maintenance, 
cost /bepef i t and acceptability 'of standard 
benchmarks emerged as a result of this review. 
The problems enoduntered by the study group, 
notablylthe lack of an ac.cepted definition ,of 
" r epreaent a t iveness , " prevented it from arriving 
at df N^efinitive statement on feasibility. 
Howevet, \several areas were identified as topics 
requiring further investigation and are presented 
in this I'eport. ^ 

Key words: Benchmarking; benchmark library; 
selection of, ADP systems; standard benchmarks; 
synthetic benchmarks; workload characterization; 
workload definition. 



1 . In t r oduc t ion 

Benchmarking is. an accepted mechanism for testing 
v^dor systems in the competitive selection, of computer 
systems within bpth private industry and the' Federal 
Government. How(?ver , due to the rising cost of benchm^ r k :j^ng 
on the part of both agencies and vendors, new methods need 
to be^ ex'pl'Ored * that will help reduce the overall costs o\ 
benchmarking. For this, reason, the concept of "standard" 
benchmark pr6g rams has r ece ived renewed interest. A 
coll ect ion (or "library") of such prog f am s could serve as a 
source from which agencies would select parameterized, 
functional synthetic programs to supplement their normal 
benchmark mix. I^n thi« context, a "functional synthetic 
program" is a computer prografn which is written to perform 
some pre-defined| ADP fujhction. Sever al - impo r t ant questions 



remain, however, related to the feasibility of such an 
approach. V 

A Government- industry study group was formed in 1^76 to 
determine the technical feasibility of the standard 
benchmark library concept. This report first surveys past 
efforts to develop and use standard benchmarks, and then 
summarizes the problems en^^^ounter ed by the study group. The 
report encjs with a set of conclusions and suggestions fo*r" 
future work. ' 



I. I. Background ^ . , ' 

Government-wide concern for benchmac4^ing-r elated 
problems has been evident since at least 1969 when it was \ 
major topic at- t^e Conference on the Selection and 
Procurement of Computer Systems by the Federal Government, 
sppnsored by the Office of Management and BudgeJ:. 

in December 1972 the Commission on Governmept 
Procurement • issued the following recommendation 
(Recommendation D-14)l ttK t>ie Executive Branch 114) 

"Develop and issue a set of standard programs • to 
be used as benchmarks for evaluating vendor ADPE 
(automatic data processing equipment) proposals." 
In response to this recommendation, the General -Services 
Administration initiated and chaired fa committee of 
Executive Branch agencies which included the National Bureau 
of Standards (NBS) , the Department of Defense, the Veterans 
Administration, the National Aeronautics and Space 
Administration, and the (then) Atomic Energy Commission.' 
The committee developed an Execut i ve .Br anch position paper 
dated March 27, 1974 (3) which stated that: 

"The feasibility of developing and issuing a set 
of standard programs to be used as benchmarks 
throughout the Federal Government for evaluating 
vendor ADPE proposals has not yet been 
established. If it is determined that these 
benchmarks are feasible, it is the recommendation 
of thi^s comrtiittee that the recommendation be 
•adopted by thfe Executive F^ranch as stated by the 
Commission on Government Procurement." 
The Executive Branch position paper added that: . 

"The primary objective of Recommendation D-14 
perceived to provide a mechanism to reduce the 
costs incurred by both the user and computer 
vendor in the benchrperk process." - . * 

It also stated that: / 




^...much preliminary work needs to be done to test 
the f ejisibil ity .of various approaches to. standard [ 
bencWKIIhrks. 

The position paper also pointed out that "criteria had not 
yet been established for determining feasibility'* and that 
such criteria should be established "at an early date." 

^ In May 1976, the Office of Management and Budget gave 
notice in the r Federal Register of . acceptance of 
Recommendation D'-14 on behalf of the Executive Branch, and 
assigned lead agency responsibility to NBS as part of its 

* existing central management role and ongoing efforts in' 
benctimarking • NBS was directed to "codrdj^iate and seek 
advancements in benchmarking within the executive branch" 

'and to "publish" various guidelines and documents, as 
appr opc iate . " 

Shortly before this time, NBS began a cooperative study 
effort with participation from the U.S., Department of 
Agriculture and Bell Laboratories to examine the technical 
feasibility ^f tbe development and use of functional 
synt^hetic programs as a basis for a common-^use ("staridard 
benchmark") , library, one of several possible appro^fches 
responsive to Recommendation D-14. All vthree of these 
organizations had* extensive experience in the development 
\and use of synthetic benchmark programs. 



1.2. Perspective . ^' 

' The technique of benc^^hma r k ing remains a necessary and 
important tool in the competitive evaluation and selection 
of computer systems within both private industry and the 
Federal Government. This is true for several reasons. It 
is acceptable to the computer, industry as a fair and 
unbiased live test of a vendor's proposed system. ^t is a 
'mechanism by which an agency ycan model its current ?hd. 
projected workloads in sueK a way as to ensure that the 
vendor's proposed system will be of an appropriate size. It 
is a test mechanism Which is repeatable within acceptable 
limits from one vehdor to the. next. Finally, for most batch 
benchmarks, the benchmark can be run against the newly 
installed system as part of an agency's acceptance testing 
procedures. 

9 , 

Benchmarking as currently practiced within the Federal 
Government usually consdsts of five distinct phases. During 
P^ase 1, the workloax3 to be performed by the new system is 
defined. T|iis usuallly requires an analysis of the current 



wolkload, a piodicLion of its future grqwth, and an estimate 
of new. appl icat ions. In Phase 2, a benchmark is constructed 
to represent the defined worRload, often i.n terms 'of some 
critica-1 period of the workload^ (e.g., a peak month). 
During Phase 3, the benchmark is tested, sometimes by 
running it on a system other than the agency's current one. 
The benchmark is then nrodified to eliminate any errors or 
m^jor machine dependenc ies , - and is suitably documented for 
vendor usq". In Phase 4, each competing vendor makes 
necessary and allowable changes to the benchmark In order 
for it to run on his system. Each vendor also undertakes to 
configure a system capable "of processing the benchmark 
witliin some agency:determined time constraints. Finally, in 
Phast^ 5, the benchmark is run as part of a timed live test 
demonstration, and its performance is compared agai^nst the 
agency-dpf ined constraints. During each of these phases, a 
cost is incurred either by the agency (Phases 1, 2, 3), by 
the vendor (Phase 4),* oi, by both the agency and the vendor 
(Phase 5).-* The impact of the benchmark library concept on 
each of these costs . is discussed in Section 3.3. 

Although benchmarking is an importanjt sizing tool, it 
is noi: an exact one. Benchmark runs are appr ox ima t ions t o 
true workload ck^mands over some agency-determined time 
frame. The degreV ,to which a benchmark is representative of 
the tr.Lie workload depends upon the complexity of the real 
workload, the accu.racy with which future workload demand^ 
can be predicted-, and the amount, of effort the ageifcy 
invests in the work 1 odd" definition and benchmark 
-construction phases. Producing h ig h-qua I i ty • benchfna r ks is 
usually a very expensive prqcess for an agency. ZoV-qua 1 1 1 y 
benchmarks, on the other, hand, ar^ less ex^ens^ve to 
produce, but usual^Iy result in higher- costs to the vendors 
(as in the case of poorly documented programs), in addition 
to a higher risk thq^ the procured system may not adequately 
satisfy the agency's requirements. It is "^he need for 
high-quality benchmarks at les's cost tcy both the' agencies 
and vendors that has prompted various efforts to establish a 
library of standard benchmark programs. 



2. Previous Efforts 

Several early efforts, notably those within the 
Department of Defense, attempted to address the benchmark 
library concept. Other related works include the use of 
standard benchmark problems by the Auerbach Corporation, a 
paper by Lucas in )972 in which he outlined a set of modu^es^ 
that could be used to construct a functional benchmark, dnd 



/ 



a study by the Mitre Corporation in 1975 in which results of 
a limited test of the benchmark Jibrary concept were 
preser^ted. More recently, the use of an internal set of 
standard benchmark programs by the U.S. Department of 
Agriculture in their own procurements appears to be the most 
pr9mising effort toward establishing feasibility. Each of 
these activities is discussed in more detail below. 



2.jr^ Department of Defense Efforts 
a. Air Force efforts 

V 

In 1971/ a' study conducted for the U.S. Air Force by 
the Mitre Corporation (II) resulted in a plan for a standard 
benchmark library for use in the competitive selection of 
computer systems by the Air Force Directorate of Automatic 

>pata Processing Equipment Selection (MCS). The study 
\ncludeda feasibility study and an economic analysis of the 

♦.Standard benchmark approach as it applied to Air Force 
procurements. The study outlined the objectives and 
operation of a benchmark library, "^and presented several 
issues related to its use. Among the issues raised were: 

1. Could vendor systems evolve in s^uch a way that they 
would eventually be "tuned" to process the standard 
benchma rk prog rams in a manner more efficient than 
the real workload? 

2. What form should the benchmarks take (e.g., actual 
user programs vs. sm^ll CPU and I/O (synthetic) 

# modules) ? 

3. Can users build representative workload models 
(i.e., benchma^rks) us ing standar.d benchma rk 
programs? ^ 

This last point was determined to be " t hc^ single most 
important issue in consideration of an MCS standard 
benchmark library." Because of this, it was suggested that 
"a trial run of the use of library programs to specify 
system wor'kloads should be performed before the Tibrary 
concept is fully implemented." The study also estimat<5'd tjie 
level of staff and computer resources^ needed to implement 
the library, in addition to the dollar saviVigs to the. Air 
Force based on its use. Because the investment decision 
would "just about break even" (i.e., costs would equal 
benefits), it was concluded that the decision whether to 
implement the library sihould be based on non-monetary 



/benefits, such as reduced' time to complete a procurement and 
reduced vendor costs. However, the study added that the 
mopt critical problem was whether user workloads could be 
represented by benchmark programs chosen from a^standard 
library, and that this question could only bo answered 
through experience. The study called for an early review o£ 
feasibility and a test run of the library as soon as it 
became operational.^ Apparently no further wo r4; was 
undertaken on this effort. 

b. Army efforts 

The d e V o 1 opme n t o. f s t a nd'a r d bene hma r k s within the 
f Department of the Army began in Septemhot 1972 in response 
to recommendations made by a Department of Defense (DOD) 
task fyrce investigating the time and cost of ADPH 
procurements. This d^velopjnent effort became a full-time 
project within the jS . S . Army Co|nputoL Systems Support and 
Evaluar<^on Command ^(us ACSSr:C ) , although .the project was 
coordinated by a joint f^teeririg committee composed of 
members from the Army, Nav^, Air Force, and Defense Supply 
Agency. ^ Initial efforts centered on tlie devejlopment of 
functional synthetic benchmark programs, data files' to be 
used l,)y the synthetic programs, and tfie development of a set 
of procedures for the use, d i st r i but i(Mi , and maintenance <*>i 
the programs^. 

A Contributor's Symposium on Standard Benchmarks was 
held at IJS^ACSSEC in October 1972 1 0 r t fu^ purpose of rc^fiping 
the standard bencfimark concept. Participants at the 
symposium included representatives from ADPE vendors, the 
(then) Business Kquipment Manufacturer • s Assoc iat ion (BKI^A) , 
interested universities,- ADPK research firms, and the joint 
steer ing comm i t tee . The f ol iowi ng excerpt from Depai tment 
of the Army Pamphlet No. 18-10-2 11) summarizes the results 
of this meet ing : 

"The symposium was keyed to the 'utility' of 
standard benchmarks, using 'the Steering 
Committee's concept as a 'strawman.' The sying>a^um 
was successful in meeting the established goals 
and in familiarizing many of ^,the potential users 
with this concept." 
%hc USACSSEC effort resulted in a contract with Caller 
Assjaciates to "define a 'standard benchmark' and its usage." 
Al though the Gal ler con tr act culminatecl in an ex tens i ve 
report [4] describing "a "kernel" approach to the Standard 
•^benchmark concept, the -USACSSEC nevertheless felt that there 
were still several unanswered 'quest ions and unresolved 
problems. Among these was the problem of mapping user 



H 6 



workload-requirements into the proper set' of "kernels." Tl^is 
appears to be the extent of the USACSSEC efforts 

c. Navy efforts 

A related effort was begun in June 1973 within the 
Department of the Navy/s ADPE Selection Office (ADPESO) . 
This effort, partly^ir^ support of the DOD effort and partly 
for 'in--house use, was directed toward developing a small set 
of synthetic programs which could be used to "enhance an 
existing set of natural benchmarks in order to gauge 
specific system characteristics" I2], Although the Navy 
effort produced five synthetic benchmark programs in which 
parameters could be set to force a prescribed load on 
various system resource's (e:g., the CPU, I/O), several 
difficulties were reported. Among them were the dependency 
of the parameters on the nature of the system being 
evaluated, and the "sheer magnitude of the number of 
combinations of program parameter values" [13] • The study 
concluded that "although synthetic programs could be 
controlled to produce a prescribed processing load on a 
giveTi system, it was not possible "to arrive at a 
generalized, comprehensive, and accurate model of system 
workloads except in the most trivial cases." It added, 
however, that "if one accepts a 'modest* workload 
characterization, aimed more at reflecting extremities and 
crucial areas, rather than comprehensiveness, it is^possible 
and reasTonable to construct a benchmark from a set of 
synthetic modules." No further work has been reported on 
this effoct. 



2.2. Aue rbach (St ahdard Benchmarks 

Perhaps the earliest reported use of standard 
benchmarks was by the Auerbach Corporation in the 
development of their Standard EDP Reports 16, These 
standard "benchmarks" were actually problems that covered a 
number of commonly performed ADP ^ functions, such as file 
updating.. The problems we re hand -coded in assembly 1 anguage 
for each vendor's system. Published 'instruction t^^nes were 
then used to calculate stand-alone problem time. A' number 
of standard ecmipment configurations were defined to make 
comparative vehticu;^ evaluations easier. Execution times were 
estimated for each problem on each configuration. Users had 
to relate their special needs to these sta«l!i3ard problems, 
and, because they were coded in assembly language^ the 
problems were written differently for each vendor ' s ''system . 
The problems were not actually run on vendor systems, and 

I 



ERIC 



9^ ^ I 



the estimated execution times- did .ngt consider 
.multi-programming effects. This approach has apparently not 
been us'ed since approximately„197J . - 

. ■ ". ■ ^ 

2.3. Lftcas Modules ' ^ ' * 

- ' . In a 1972 article 19]', 'H. 'Lucas . Suggested that* ''a set 
Qf rndustry-widfe synthetic mbd\iles be c3eveloped and provided 
^y each computer manufacturer for his equipment." The 
intended use of these . modul es was primarily to assist usep's 
in modeling their workload ( i ^e . , constructing a benchmark) 
for use in the selection process. 



?cUjon 

Th.e proposed s^^^^.tic modules were divided into three 
categories f - compnm' attributes, operating system 
attributes, and program execution. Both the compiler ahd 
operating ^system categories contained modujles primarily 
concerned wi^t evalu^^ng error detection features'. The 
program execution category attempted to "represent all of 
t^e common operations found in both commercial and 
scientific data processing." Examples of such execution^ 
modules are: fixed point operations, stress analysis, 
forecasting model, and fixed length record update. Each of. 
these proposed modules had associated with it One or more 
adjustable, but very general, parameters. Sample parameters 
included: number of calculations and precision, size of 
problem, number of forecasts and number of periods, ai,td 
number and size of f ie Ids updated . * / 

- Although Lucas suggested that , a user could construct a 
benchmark by selecting a group of. -synthot ic ' module^ from 
such a collectiion, - he did not specifically .address the 
problem of how this mapping from user requirements into 
synthetic modules and parameter settings should be done. 
simply states that . "thfe evaluator must determine the 
anticipated job lo&d for the system to be evaluated" and 
, that "he t^en selecfB r epf e^Jktat ive models (i.e.,, synthetic 
modules) and joins them - togel^e r into jobs which model that 
load." ' ' ^ 



V 



2.4. Mitre Study 

• A study conducted by the Mitre Corporation in 1975 18.J 
for^'NBS^ stated three primary objectives: "to develop the. 
Application* Benchmark. Library concept, to perform, a 
prel-iminary feasibility test of ^this concept and to, identify 
related areas for further study." The "development of the 



concept" c<^nsi»ted Df a suggested approach concerning •the 
structure, creation, uae^ ma intenance , and documentation of 
an application library*, ^he "preliminary test" consisted of 
a controlled testing of two parameterized • application 
programs, one written in FORTRAN. and the other in COBOL* 
-^•Areas fpr further study" included investigations into tH^ 
"operational" and "economic" feasibility of the library 
conci^pt. One of the suggested "operational ''feasibility" 
tests /included testing the ability to map user progr.ams into 
library prograifrs. In summary, the' Mitre report suggested 
physical sitructure for the library, outlined a library^ 
maintenance procedure, and showed that the resource demands 
of parameterized programs could be controlled in a 
predictable manner. 



Department of Agriculture Experience 

In 1972-^1973, as part of its procurement of a new 
system for which few operational programs existed, the U.S. 
Department of Agriculture (USDA) undertook to develop a set; 
of functional synthetic benchmark programs. Although the 
procurement was subsequently consolidated with other 
procurements, the same benchmark programs, with revised 
workload estimates, were used for this consolidated 
procurement. Three vendors ^submitted proposals, and all 
three deanonstr a ted their proposed systems using the 
synthetic benchmarks. The consolidated procurement was 
cancelled, however, without an award being made. At the 
present time, USDA is going forward with sevteral new, 
independent computer procurements. Each procurement 
involves sizing the . present and future workloads CVf a 
rd if ferent group of USDA agencies. The same basic set of 
syn the t ic Hs'^s^ichma rk programs used in the original 
consolidated procurement is being used for several of these 
■ pr ci<cur ements [10]. However, the programs have been upgraded 
irx^^ number of ways since they were first developed. Mo/re 
importantly, a standard procedure was developed by USDA for 
its agencies to follow in projecting their worklX^ad& and 
^ppirng them to the syn the t ic prog rams . The fol lowing 
^arfegraphs discuss the USDA benchmark programs, the workload 
mapping pr ocedur es , and var iou^ybechn ical considerations and 
issg€<s related to the USDA effort. 

a.. Structure of the programs * ^^ 

Each of the US^ benchmark programs is designed to 
perform some ^MOmmon ^ data-manipulation, function. • Major 
categories of functions are: ^ (1) batcn^ versus on-lirie 



pi-occsGing, (2) serial versus non-serial data accessing, and 
(3) data retrieval versus da%a update operations. A 
Synthetic program was developed to represent each required 
combination of t+^ese major categories (for example, batch 
serial update"). This effort resulted in a set of synthetic 
programs which represent distinct ADP operations across many 
applications, rather than programs which represent complete 
applications (such as "payroll"). The synthetic P^o^J^'"^ 
are inherently quite ^mall and generate little CPU load 
except for that associated with moving transactions and data 
records in and out of memory. A .Common routine is 
incorporated into each program, however, which can be set to 
consume any amount of CPU time and any amount of memory. 
All on-lihe synthetic programs are designed^ to execute in 
conjunction with vendor-supplied transaction, processing 
software, which is expected to pass to the programs, one 
transaction at a time on, a demand basis. 

The synNhetic programs are supported by a number of 
other software and procedural components, which together 
constitute a benchma r k i ng system. These supporting programs 
include: a d^ata generation program, a post-demonstration 
analysis program, a workload mapping procedure, and a 
workload tally program. Some of thgse components are 
relevant to this report and are therefore discussed at 
greater length in the following paragraphs. 



b . Tec 



Ji^iical considerations 



By virtue of its use in actual procurements, the USDA 
benchmark system has had the benefit of several critical, 
technical reviews. The more salient technical issues of the 
•U&pK standard benchmark effort are discussed here. 

First, it has been proven feasible to map the workloads 
of a variety of USDA agencies to the benchmark programs. 
This issue is discussed at greater length in the next 
section. The USDA mapping effort did result in one or more 
new synthetic programs, or, variations of programs, being 
proposed in order to more closely match certain jnaDor 
workload functions. Each proposal for a new program was 
evaluated to determine whether the resulting improvement in 
representativeness would be sufficient to justify the cost 
of developing the new program. On OQ^sion^ new programs 
were deemed to b% necessary. 

There was considerable concern at the outset of 'the 
" USDA effort whether a vendor could take unfair advantage of 
some inherent characteristic of the synthetic programs 



10 



for examplC/ by placing the entire executable portion of 
code in a small, high-^spf^ed memory. The approach USDA took 
in dealing with this issue was to attempt to identify each 
poteiitial weakness and correct it. A technical solution warv, 
• develNoped for each potent ia I weakness that was identifie.d. 
USDA reports that no jiwe aknesses have since been found which 
could not be overcome. 

One mqjor problem which USDA faced was interfacing 
their benchmark programs wi t h ^soph isf icated vendor software* 
^for which standards did not exist. Although this issue is 
not peculiar to synthetic programs, it is nevertheless 
. important enough to mention here. In particular, \the USDA 
benchmark depexxi^s upon a transaction processor and a 
data-base managemVnt system. However, only the most 
fundamental functions. of these subsystems are used and even 
then, the vendors ar^ allowed to modify t,he program 
interfaces. Although a more accurate workload 

representation could be produced if segments of the 
benchmark programs were tailored to the vendor software, 
this was not deemed feasible for a number of reasons. One 
major reason , ^ presumably , was the desire to run the same, 
unmodified programs on all vendor systems. 

One potential weakness of standard benchmark programs, \ 
v.i.^^iiilw^r r ed to in Section 3.1 of this report, is the potential 
/%tTrrthe programs to influence the evolution of vendor 
systems. Nothing in the USDA experience can provide an 
answer one way or another on this issue. 

c. Workload mapping ^ 

^ Because the current series of USDA , procurements involve 
^s^veral different USDA agencies whose computer processing is 
performed at various computer centers, each agency is 
required to project its own future workload to be supported 
by the new systems. Technical personnel supporting the 
procurements, however, do provide the discipline to assure 
the compatibility of format, in addition to combining t^he 
workload projections for each center. 

Early in its procurement efforts, USDA deemed it 
necessary to use a standard procedure for mapping agency 
workloads to the synthetic benchmark programs. Such a 
procedure was developed and has since evolved as personnel 
^of several USDA agencies have used it. The. workload mapp^ing 
procedure is incorporated into this report as Appendix B. 
In summary, the procedure consists of 'four steps: 



ERIC 6 . 



V ■ • ■ ■ • , 

1. Identify major agency'^Iunc tions tha^ result in 
ADP workload. Wherd practical, functions a^Ve 
budget line .items, s'uch as "cotton. "Hvans. 
Establish a discrete unit of activity measure fbr 
each fiinctio?» (e.^., "number of loans") . 

2. Determine what ADP operations result from on*? 
• • occurrence of each function.' These ADP operations 

are further quantified .in terms of occurrences of 
various synthetic benchmark programs, or other 
specific?' benchmark workloads, such as program 
compilations. 

3. Project the units of activity for- each ^ma jot agency 
function over the system life. Wherfe practical, 
this activity is performed by budget personne""! or 
other* non-ADP persons. 

/ 

4. Extend the quantifications of agency functions to 
ADP operations; i.e., to synthetic programs and 
other benchmark components. USDA has developed a 
computer program to assist its agencies in 
performing this step. ' ^ 

Step 2 above appears',to ij/e the most tedious, and requires 
that personnel have a thorough knowledge of their ADP 
operations. These perspnnel must also be thoroughly 
familiar with tlie synthetic benchmark programs. USDA 
reports that approximately eight hours of "Hutoring are 
required to familiarize personnel with these procedures. 
Further discussions are sometimes necessary to clear up any 
misunderstandings that may surface later. Nevertheless, it 
is reearted that agency per sonnel v wi thout prior knowledge 
of t^ benchmarking system, hXp performed the mapping 
process effectively, and in several instances, with 
relatively little training. This training procedure has 
been the source of some changes to the synthetic programs, 
since it is here that new people have the opportunity to 
review tVe programs and sur f ace. def ic ienc ies with respect to 
the way the programs represent real ADP operations. 

d ; Effectiveness 

The ..USDA benchmarking system appears to be satisfying 
its three major objectives. 

First, a single, procedur e and. a single set of tools and 
programs are^ serving to benchmark a series of systems. 
Repetitive use of the same tools i^ certainly result 



in 



■ / X 



12 I 



ERJ.C . n 



1 



much better calibration and much JLe^s cost to th\^ Government 
than would the development of a n^ benchmark ' for each 
procurement. It is premature to claim similar cost savings 
for the vendors^ but it seems likely that their subsequent 
benchmark coai^s using these programs will be reduced. 



Second , in order to . equa 1 ize. 
vendors* are provided with the same 
The fact that tfte original, albeit 
resulted in three demonstrated 
indicates that this objective was 
vendors who benchmarked in this early 
not report any suspected biases in the 



the i r pr oposa Is, all 
demonstrable workload . 

abor ted , procurement 
and proposed systems 
achieved . The three 
procurement effort did 
synthetic prog rams . 



In fact, a bias was claimed in one of the few operational 
programs which were included in the benchmark. USDA reports 
that recent analysis of veckJor proposals and benchmark 
re^uJLts (which cannot be publiSned for proprietary reasons) 
indicates that the three responding bidders were as close in 
their configurations as could be established by suph 
comparisons. 

was to assure that the sysnems 
proper capacity to pigrform The 
spevaking, the only way to 
is achieved is* to track the 
ability to meet the workload demands over 
This assumes of coutse that t1ie workload 
accurately made. As a practical matter, 
of other ways that the confidenqe leveji 
of these benchmarks can be improy^^. 



The third USDA objective 
which are proposed have the 
projected workload. Strictly 
prove that this objective 
in9.tal led system ' s 
the system" s 1 ife. 
projections can be 

are a number 
"correctness" 



there 
in the 



Steps which USDA has taken include simulation, analytical 
analysis, and extensive execution of the benchmark on, 
multiple systems. Some of these efforts have led to a more' 
careful analysis of different elements of the benchmark and, 
in certain instances, have resulted in various adjustments 
to the benchmark programs themselves. In general, this 
analysis has supported, to the extent possible, the validity 
of the USDA benchmarks. 



/ 



3. The Benchmark Library Study Group 



Because it was assumed that enough wo r k had pr ev iousl y 
been done to determine the feasibility of a standard 
benchmark library, an NBS-sponsor ed study grou^ was formed 
in 1976 to address this question. As will he seen, this 
assumption proved false, principally because there existed 
neither fithin private industry nor wjthin the Government 
any accepted criteria for determining when a bonchmcVrk w 



ERIC 



13 



^^^^ 

"representative" of a computer woiklo 



The 



study group consisted of personnel from the 
Department of. Agriculture, Bell Laboratories, and'NRS. It 
met several^t imes between March 22, 1976 and October IJ, 
1976, The stated objective of the study group was to 
" • . .attemp/t. to estaBlish the technical feasibility of 
benchmat'k library concepts .for use within the. Federal 
Government." In order to accomplish this objective, the 
? foil owing tasks were established: 



1. Define relevant terms. 

2. Determine scopo of the bencfimark library. 

J. Identify pcUential pioblemjy associated with the 
benchmark library conce^)t^ via interviews and a 

de t a i I ed rev i ew of pr oy^M] s r f f o r t s . 

« ^ 

4. Determine, criteria against which a propo53ed 
benchmark library can be evaluated for the purpose 
of determining its acci^pt ab i 1 i t y . Although 
evaluation criteria shouK] be established lor four 
majoL areas (technical, management, cost, and 
acceptability), ^empiiasis was to be placed on Iho 
techn iccU as pec t' . 



5. 



6. 



Apply the evalua t ion cr i ter la establ i shed above t o 
existing or proposed benciimark library prototypes. 

Based on the above results, determine, in general, 
whether any benchmark library (existing, or 
pr oposec^ ) is techn ica I 1 y feasible (i.e., adequa t e 1 y 
satisfies the established evaluation criteria). 



Task 1 resulted in a glossary of terms (see Appendix A) 
a result of Task 2, the following scope was defined: 

"The study wiil^ address the feasibility of 
establishing and maintaining a library of 
synthl^tic applicat^pn programs which' will be 
useful for inclusion in benchmarks. More 
specifically, 'it will be- limited to programs with 
thesecharacteristics: 

(a) They may be written in standard COBOL or 
FORTRAN and must conta in only standard 
compo nen ts of those 1 anguages : 

(b) They are capable of representing batch or 
on-line transaction-processing appli cations 



As 




7^. 



primarily of a"^ ' cotDmerc ia 1 ' (vs. 
^ •scientific') nature wnich are describable by 

wel l--def ined functiorts*** 

The Results of Task 3 are described below. It soon became 
apparent as a result of Task 4 that determining the 
technical feasibi Lfi ty of a library of standard benchm^rl^ 
programs requ i r e)i much more preliminary work than had 
>ilready been do«i^. Section '4 of this report discusses this 
problem in more detail, and Section 5 suggests future 
courses of action. ( 

'Several issues evolved during the course o^^^his study 
relative to the implementation, maintenance, cost, and 
acceptability of a library of standard benchmark programs. 
The following paragraphs briefly discuss each of these 
issues and attempt to assess their impact on the overall 
feasibility of standard benchmark programs. 



3.1. Implementation Issues 

a. Identification of a set of ADP functions common 
to many agencies 

Cen'tral to the standard benchmark concept is the 
assumption that ther€f exists a reasonably small number of 
App functions common to many agencies. Before a benchmark 
-library could be developed, it would thus be necessary to 
first identify these functions. This could be accomplished 
either by surveying large Government installations or by 
reviewing the processing and benchmark requirements found in 
recent computer system Request for Proposals (RFP's). 
Assuming such a set of functions exists and can bo 
identified, then benchmark programs could be written or 
obtained to iinplement these ADP functions. It is this 
collation of benchmark prog rams -^l^i^^h would constitute the 
benchmark library. 

b. Ability of benchmark programs to accurately 
represent agency workloads 



Given that a set of common .agency functions can be 
identified, a related, . but equally important question, 
remains: Can the benchmark programs which implement these 
fiunctions be. combined and parameterized in such a way as to 
accurately represent agency workloads? For example, it may 
be found that many agencies perform a particular type of 
sort funct^n. Although a benchmark program could be 



J 



ERIC 



wr i t tbn to 
whether the 
aircoiirit for 
This problem 



dupl icate this 
pLO<5ram can be 
d i f f er Ing agency 
is Cur the I compl i rated 



f unc t ion ^ the quest ion remai ns 
parametev i/od to adequately 
volumes, f i l^e structures, etc. 

by ^'i^tie lack of an 
for a benchmark to be 



accepted definition of what it means 
"representative" of a workload, 

c. Synthetic programs could prcocUice "overwhelm i nq side 
effects" 

^ A suqqested alternative to the "functional" benchmark 
programs a^xiescribed above is the tise of r esour ce-'Or ien^ed 
synthetics. These synthetics are parameterized programs 
whichf can be controlled to place a prescribed load on maj^^r 
syr>tc:^m rer>c)urc:es. • The resource~.oi ientcnl synthetics peiMOrm 
no useful work, but rather they exercise selected r^ystem 
resources in some pr<^-defined manner, for example looping on 
a series of CPU-bound statements. One of the problems that 
has been raisecJ relative to tht^ us(^ oi resource-oriented 
synthetics as standard bc^nchmark proijrams is their inability 
to represent a given workload' s rosourct^ demands across 
sys.tem lin(^s [131. For example, beccvuse they are usually 
written in a higher-level language, t lu^^ translation ol 
ciutain language^ constructs, such as. a PKkVorm s>tatement in 
COnOb or a DO statement in FORTRAN, may [uoduce .such 
drastically different resource .demands (rom s.ystem to 
system, that the synthetic's ability to rcprciuMU the i t^i 1 
workload is destroyed, 

d. Unknown offi^cts of optimizing compilfMS, on 
. "stylized" syn t he t i c pr og r am 

Another problem that has been raiscnl lelative to t lie 
us.o of resource-or i etited synthotiV f)rogramr) concerns, t In^ 
unknown, -uncontrolled effects of opt i m i z i !i() complleis. MM* 
Because they are highly "stylized" (i-e., ai/tilicial m 
nature), such synthetic^ programs may be more' (or less) 
susceptible to the ' effects of optimizing compilers. 
Consequently, the resulting performance impact on the 
synthetic 4)rograms may not be typical of that which would 
occur to the r ea 1 wor kl oad • This problem also applies ^o 
some extent to functional benchmark programs. 



e . 



Possibility of inherent biases 
vendors 



f o r o r ag a i n s t s om e 



use ^f any set of standa\rd 
benchmark programs concerns t he \ poss ib i 1 i t y of inherent 
b iases for or agia inst som\ vendors . Al t hough a 



A problem related to the 

>nce I 

prog ram D lases i or or agii^ 



ERIC 



16 



benchmark should place a representative load on each 
ven<Jor ' s systejTi, the benchmark should not peorform' actions 
.above and beyond those needed to represent the actual 
workload. If it does, the benchmark may unduly bias one 
vendor over another. 

Pi suggested solution , to this tproblem is t^he 
incorporation of some mechanisiir as part of the library's 
normal maintenance procedures, Ifor responding to and 
resolving vendor complaints. Such actions may consist of 
eliminating questionable programs from the library, or 
modifying them to the satisfaction of all vendors. 

f. Possible evolution of vend«)r sy^ems tailored to 
benchmark progranJffe 

Assuming that a library of benchmark programs is 
usable, the c^estion has been raised whethf?r vendor systems 
will evolve in such a way as to maximize the performance of 
the benchmark programs, at the expense' of the workloads 
which will actually run on those systems^ Some continuous 
mechanism would therefore be needed, again as part of the 
library's normal maintenance procedures, to monitor the 
possible development of this situation. 

g. Inability of synthetic programs to adequately test 
compilers, operating system contfrol features, etc. 

Finally, because of the limited number of programs that 
might be in a benchmark library, there is the danger that 
such system functions as compiler diagnostic procedures, 
operating system utilities, etc. would not be adequately 
tested. However, as suggested by Lucas [9],^ sta'nd^rd 
programs for testing these features could be developed. V- 

■. 

3.2. Maintenance Issues. 

a. Abilit^of benchmark programs to meet 
state-of-the-art changes 

Because of the highly dynamic nature "of computer 
architectures and languages, a library of benchmark programs 
would have to be adequately maintained in order to prevent 
them from becoming obsolete. Obsolescence may result either 
because the prog rams would simply no longer run, or because 
they would be incapable of representing- important, new 
architectural features. This latter point is exemplified by 
the recent popularity of paging systems: a benchma1:k 

!! 

17 



ft 



program not capable of representing the pattern of' memory 
l-ef^retices of a functional application could be biased 
^jUfcber in' favor of o^ against some vendors. These potential 
problems, of course, also apply to current benchmark 
methods. In or'der to keep the benchmark programs consistent 
</ith state-of-the-art architecture and language features, an 
on-going review of the benchmark library programs would be 
needed . 

•» » , 

b.. Mechapisms needed to resolve agency and vendor 
problems and complaints 

Irrespective of the particular benchmark programs' in 
the library, no set of programs will satisfy all agency 
needs. Also, it is possible that a vehdor may claim that 
one or more of the* programs is biased for or against a 
particular system. Prompt* resolution of these problems 
requires a maintenance mechanism capable of extending the 
library if enough agencies find it deficient in particular 
functional areas, and of objectively testing vendor claims 
of b ias . 



3.3. Cost/Benefit Kvaluation 

As input to an overall feasibility study of the 
benchmark library .concept, the cost of such a library, in 
relation to its expected dollar benefits, shbuld be 
evaluated. If a library of standard benchmarks were 
developed, agencies would hav6 access to 1 1-documented 
programs, easily portable across vendor lines, with which to 
construct or supplement their normal benchmark mix. - This 
would result in reduced time and cost to. agencies in 

.constructing and d'otumenting their 'benchmarks , as well as a 
reduction in vehdor conversion costs. In addition, 
well-documented and tested benchmarks would most likely also 
reduce the time to complete a live test demonstration, a 
cost savings to both agencies and vendors. In a full 

. cost/benefit ,fevaluati«>n, these benefits should be weighed 
against the cost to develop, use, and maintain a library of 
standard benchmark§. The benchmark study group did not 

' conduct such a cost/benefit analysis other than .to identify 
the above factors. , 



ERIC 



9^ ^® ^iS 



i 



3*M* Acceptability to Agencies and Vendors 

As part of a gj^Tteral feasibility study on the benchmark 
library concept,, the anticipated ' use of the programs by 
agencies would have to be evaluated. This could be 
accomplished, as an example, by offering ^a preliminary set 
to a number of agencies condu^et'ing procurements and 
evaluating .their use of the benc'lM^ark prograjms. It should 
be^ pointed out that several procurements have alreacJy taken 
pi ^CG in wh ich agenc ies have used pre-existing benchmark 
programs because they Were available/ well-documented, and 
f air ly Vepresen»tat ive in function, 

^ In add ition to evaluating agency acceptance, vendor 
response \o the standard benchmark concept should be 
solicited. It is anticipated that some vendors will welcome 
clean, well-documented programs as a way of reducing their 
benchmarking costs. As stated in the Executive Branch 
position paper^ on Recommendation D-14, "CBEMA*s (Computer 
and Business Equipment Manufacturers Association's) primary 
concern is to insure that benchmarks take a form such 

that they can be constructed to be representative of the 
user'sv needs, to be consistently representative across 
vendor equipment lines, and not to restrict the vendor's 
ability^and r espons i.bi 1 i ty to configure his computer systems 
for most efficient processing." The vendor community has in 
the past cooperated with Federal efforts to arrive at better 
benchmarking approaches (a good example of this is the joint 
Cover nment- Indust ry Remote Terminal Emulation Project 15]). 
There is no reason to believe that vendors would not 
cooperate in addressing the standard benchmark concept. 



4. Problems Encountered in Attempting to Determine 
Technical Feasibility 

In attempting to answer only the technical feasibility 
question (and not such other related questions as 
cost/benefits, acceptability, etc.), the benchmark library 
study group determined that a s^t of evaluation criteria 
should be established. Using these criteria, a candidate 
benchmark library could then be objectively evalua^.ed as to 
its technical acceptability. These criteria were to , be 
estabi ished apart from any particular benchmark library-. 

As a result of a cpncerted effort to establish such 
evaluation criteria, it was soon determined that there- was 
no common' agreement among the study group members (or for 
that matter, within the ADP community as a who 1 e) . concer n ing 



19 



24 



the meaning -of "representativeness" as it applies to 
benchmarks ol existing workloads. Since the 

representativeness question was central to- the evaluation 
criteria, this faisod an obvipos ot>stacle. 

For discussion purposes, a - theoretic/l' approach was 
c^ev^lopcd Coi^ defining " repres'entat i veness . " A series of 
experimental tosts (i.e., "^valuation criteria")^ were 
proposed such that if a candidate benchm^ark library "passed, 
those tests, then it would bo deemed "technically 
acceptable," at least as far as' its "useability" and 
"portability" are concerned (see Appendix A for a definition 
of those terms). This process is outlined in Appendix C and 
i's an example of the typo and complexity of evaluation 
criteria which the study group envisioned. It was generally 
agreed, however, that current benchmarking practices ^ire not 
subjected to this level of rigorous definition and that such 
a degree of representativeness may hot be achievable in 
practice. This did point out the need, however, for an" 
empirical and acceptable test of representativeness. 

Finally, in attempting to determine technical 
feasibility, the question arose whether the standard 
benchmark approach should" be compared against existing 
benchmark construction approaches or whether it should be 
examined on its own merits. Since more|^ traditional 
approaches to benchmarking have themselves never come under 
close, scientific scrutiny, it was ^believed that the 
benchmark library concept should be evaluated relative to 
existing practices. 



5. Cone -iioris 

Based on the previous findings, the benchmark library 
study group concluded that although the stancjard bqnchmark 
"library concept has been used wj.th apparent success within 
particular agenciies (e.g., USDA) , there is not yet 
sufficient data to establish the f eas ibi I i ty \ of such an 
approach for Government-wide use. The continue^^ use of such 
an approach by USDA, however, and their post-^, instal lat ion 
experience^ w i 1 l---pj;pv ide more useful dat'a to- help answer 
some of the /issues and problems ^raised >arlier. 
Furthermore, the use of USDA's benchmarks by other agencies 
on an ad ' hoc basis will also provide valuable experiential 
data to help further answer the feasibility question as it 
applies across agency lines. To this end,. NBS is> curr-^ntly 
exploring with USDA the possibility of makir^g the US'DA 
benchmark programs, along with a companion user|'s. guide, 

- ^ • •' -'20 \ 

ERLC • " \ . . 



available to all Federal agencies. If this is done, the 
benchmark material would be distributed through a central 
gource, such as the National Technical Information Service . 
' ..(NTIS). Requests for tKe benchmark material could th-en be 
monitored as an indicator of agency interest in the standard 
benchmark concept. 

As a result of the study group's review, ♦ it was'" 
apparent that' .a technical foundation had not yet been 
^ established for addressing several fundamental questions in 
all phasefe of the benchmark process: workload definition, 
benchmark construction, etc. It was also clear - that the 
best of known practices [12) are being used by only a 
^ handful of agencies. Furthermore, in spite of the 
relatively large number of Government procurements that have 
been conducted thus far, surprisingly little data exists on ^ 
the relative effectiveness of alternative benchmark 
qipproaches to prop6rly size computer systems. Some specific 
questions that the study group believes should be addressed 
ate: \ . , * 

1. What should be the objectives, constraints, and 
quality measures of a benchmark mix demonstration? 

2. Does there exist a common set of ADP functions 
across agencies? 

3. Can a benchmark program be parameterized in such a 
, way So as t.o accurately represent these logical 

functions, as well as. any agency- r equ i r ed data 
' volume s? ^ 

4. How can possible benchmark biases be ident i f ied- and 
eliminated? ^ 

5. What are the proper analysis techntiqut^s which 
should be used to define a workloa?f^ prior to 
benchmark construction? 

6. What is the proper definition of 
"representativeness" in the competitive selection ■ 
environment? 

/ In addition to answering the above questions, more of 
an' exchange of ideas and experiences is .needed among 
agencies who have conducted computer system procurements. 
Furthermore, in keeping with the spirit of Recommcnda-t ion 
D-14, other approaches to reducing benchmarking costs should 



ERIC 



21 



also be explored. ' One example is the development of a 
*^ibrary of** too^s" to assist agencies in the workload 
analysis - and benchmark prepar at ion phases . It is believed 
that only through^ an in-depth analysis of the problems and 
costs associated with each phase of the benchmarking process 
will efforts to reduce overall benchmarking costs attain 
their maximum potential payoff. - ^ 



! 




4 



References 



1 • 



3. 



4. 



5. 



Department of the Army, "Development of 
Benchmarks/* Management Inf ormat ion 
Information Processing Systems Exchange 
No. 18-10-2 (May .1973) ,1-8. 



Department of the Navy ADPE 
"Review vt Standard Benchmark 
memorandum (July 31, 1973) 4 



Select ion 
Ef fqr t ," 



Standard 
Systems, 
Pamphlet 



Office, 
internal 



Executive. Branch Position Paper, "Proposed 
Executive < Branc^rh Position/Implementation for 
Recommendation D-14 of the Report of the Commission 
on Government Procurement" (March 21 r 1974). 

Galler Associates, "An Automated Synthetic Standard 
Benchmar Techn ique , " Technica 1 Report A- 5029 , 
Arlington, Virginia, undated . 



General Services Administration, 
NBS/GSA Publ ic Workshop gpt^ 
Emul a t ion , " GS A/ADTS Repor t CS 



"Summary of the 
Remote Terminal 
76-2 (February 



Gosden, j: 
Compa r isons 
of the IFIP 
57-61 . . 



and R. Si'^son, "Standardized 
of Computer Per forma nee , " Proceedings 
Congress (North-Holland Co., 1963) 



7. Hillegass, J., "Standardized Benchmark Protelems 
Measure Computer Performance," Computers and 
Automation (January 1966) 16-19. 

8. Loring, P., "ADP System Procurement: Concept and 
Feiasibility of an Application Benchmark Library," 
Mitre Corporation, Technical Report* No, 3013 
(March 1975) • ' 

9. Lucas, H., "Synthetic Program^ Specifications for 
Pe r f ormance Eva luat ion , " Proceed ings of the 
National Conference of the ACM, Vol- 2 (August 
1972) -1041-1058- \ 

10* McNeece, J. and^R. Sobecki , "Functional Workload 
Characterization," Proceedings of the 13th Meeting 
of the Computer Performance Evaluation Users Group, 
NBS Special Publication 500-18 (September 1977) 
13-21. 




23 



ERIC 



/ 



11. Mitre Corporation, "Approach Plan -for, a St'andard 
Benchmark Library for Use in Computer System 
Selection," unpublished report (December 15, 1^71). 

12. National Bureau of Standards, "Guidelines for 
Benchnfarking ADP Systems in the Competitive 
Procurement Environment," FIPS PUB 42-1 (May 1977). 

13. Oliver, P., et al . , "An Experiment in the Use of 
Synthetic Programs for System Benchmarking," 
Proceedings of the National Computer Confe;:ence 
(1974) 431-438. 

14. Report of the Commission on government Procurement, 
Recommendation D-14 (December 1972). 



IT 



24 




/ 



Append ix A 



•ERIC 



Glossary of Term» 



ACCEPTABILITY - A desired combination of qualities of the 
proposed benchmark library including Its proven feasibility 
(i*e.r portability, maintainability, and useability), as 
defined herein, which would lead ultimately to its use 
throughout the Federal Government. 



APPLICATION PROGRAM - A computer program which I directly 
contributes to the processing of end work, as apposed to 
computer systems programs , language processors , and other 
utility programs. 



BATCH PROCESSING - A mode'^of computer processing which is 
characterized by the concurrent availability to the computer 
of a complete set of input data for a given job to be 
processed, the execution of which is not controlled in 
real--time (i.e., on-line) by a user. See Transaction 
Processing. 

BENCHMARK - A set of computer programs and associated data 
tailored to represent a particular workload, and used to 
test the capability of a computer to perform that workload 
within a predetermined 1 imi t . 



BUSINESS DATA PROCESSING - A broad class of computer jobs 
which perform administrative and logistics type functions, 
^nd are characterized by heavy demands for data input and 
output relative to the amount of computat ion per formed . See 
Scientific Comput ing . 

* * 
EVALUATION CRITERIA - The set of measurement standards (to 
be) established as a part of this study as a basis for 
evaluating the degree to which proposed solutions satisfy 
real or potential technical deficiencies of a benchmark 
library. 



FEASIBILITY (of a benchmark library) - The technological 
capability to establish and maintain a usable i^et of 
synthetic benchmark programs that can be assembled and 
adjusted to represer^t large classes of Federal computer 



ERIC 



A-1 



workloads. See Usable. 



! 



FUNCTIONALLY-DESCRIBABLE WORKLOAD - A computer workload 
which can be characterized and quantifie^d in terms of 
well-defined and predictable proc^essing functions. See 
Resource-Oriented Workload. 



LIBRARY (benchmark library) - A collection af synthetic 
benchmark programs which have been tested and documented for 
general use by Government agencies in computer benchmarks. 
See Synthetic Benchmark Program. 

MAINTAINABLE - The requirement that a benchmark library be 
supported by systems to test and document additional library 
programs, to respond to deficiencies, and to. update the 
programs as a result of changing technology. 

MIX - A combination of different benchmark programs and data 
which together correctly represent the real workload. 

PORTABLE - A requirement of synthetic programs in the 
benchmark library to represent ^ specified amount* of work on 
different computers without undue bias resulting from 
dtfferences among the computers and their systems software. 
Also refers to the ability of benchmark programs to run on 
different systems with little or no source-code changes. 

QUANTIFY - With respect to a computer workload, the process 
of expressing the workload in numerical values. 

REPRESENT - The ability of a benchmark to impose the same 
demands on a computer system 'as the real jobs which will be 
processed on that system during a given time frame. 



RESOURCE-ORIENTED WORKLOAD - A computer workload which is,, 
characterized and quantified in terms of its consumption of 
computer resources. See Funct ionally-Descr ibable Workload. 



SCIENTIFIC COMPUTING - A broad class of computer jobs which 
involve extensive mathematical functions and are 

\ ■ 

A-2 S2 



characterized by heavy demands for computation relative to 
the amount of data input and output performed. See Business 
Data Processing. 



SYNTHETIC BENCHMARK PROGRAM - A parameterized, functional 
computer program designed to represent a particular class or 
function of application programs for benchmarking purposes 
only; the synthetic benchmark program serves no .cM>ker 
useful function. 



TRANSACTION PROCESSING - A mode of computer processing in 
which data is available as a function of time, usually when 
the transactions result from an on-line user. See Batch. 



USABLE - The ability of the potential library of synthetic 
benchmark programs to represent an applicable computet 
workload. A necessary ingredient is an effect iv(^ method of 
analyzing and mapping the workload quantification to units 
which are compatible with the synthetic program parameters. 





A-3 




USDA Workload Map^ng Procedure 

A 

Preface 



The following material has been extracted from the USDA 
benchmark system documentation. It is not presented here as 
a stand-^alone procedure^ since the complete docwnenta t ion 
and some tutoring would be required to follow the procedure^ 

1 • P^r jv^ Benc^majji Wo r kl oad 

The benchmark workload is somewhat unique in its 
object ive to establish the process ing capac i ty of the 
system. T txa t is a different objective than cost 
just i f icatTon , i.e., calculating the value of the system^ 
which is concerned with all work which the computer will do. 
The benchmark will be based upon the projected workloads 
during periods of maximum throughput, which tend to recur in 
daily, weekly, monthly, or annual patterns. The activities 
described below are necessary to quantify this workload. 



(a) Identify quantifiable events which represent agency 
functions. These functions must be major agency program or 
administrative functions. The proper level of detail for 
these functions is the highest one which can result in an 
explicitly determinable set of ADP operations. A Commodity 
Credit Corporation loan^ for example, is not ' suf f ic lent 
detail, because there are many kinds of such loans^ 
requiring different processjlng. The output of this activity 
will be a^'list, for each agency, of the agency workload 
functions, and the specific events to be quantified for 
each, i.e., applications processed or loans made. 

i(b) 'Identify and define benchmar-k ADP moderations. A 
benchmark ADP operation will be directly and explicitly 
"represented in the benchmark workload mix by a synthetic 
pr ogr am or some other wo rkl oad category. Not all prog rams 
in the library have to be included^ and there are some 
workload categories which cannot be represented by synthetic , 
programs. For example; there may be high volume ADP 
applications which are too complex to represent in synthetic 
programs. Other categories of work^ such as compiling> 
sorting^ and data base quer y 1 anguage operations^ will use 
vendor software exclusively. The output of this activity 
will be a list^ with descriptions^ of the ADP operations 



O B-1 

• ERIC ' ^5 



likely to constitute significant parts of the peak workloads 
to be benchmarkecK A single composite list will ^pply to 
all agencies. It is possible that one or more of these 
operations might prove to be insignificant when the peak 
periods a^e finally identified and quantified, and might 
then be omitted from the benchmark. 

J 

(c) The volume for each agency quantifiable event 
identified in activity 1 (a) mudt be projected over the 
scheduled life of the computer system. Quantification for 
each year is required for each item. More detailed 
quantification is also necessary for workload items which 
experience cyclical ups and downs of volume within *a year. 
If the same cycle is repeated annually, a single profile 
giving the workload percentage occurring in each month will 
suffice for all years. Stitl shorter cycles may be 
expected,. in par t icular , da ily cycles for on-line workload. 
A single profile af daily clientele arrival rates may be 
provided for all those on-line functions triggered by the 
public at distributed locations. Th<? output of this 
activity will be ^ first of all, a columnar charC with agency 
quantifiable events (by code and name) down the left side 
and workload across - the top, as shomi in the Workload 
Projection Form, Figure 1. Second, more detail will be 
provided, by hour of day, or other period, to show volume 
cycles of shorter duration. The two kinds of projx?ctions 
will make it possible to project workload for any particular 
point within the scheduled system life. 



(d) Determine, by analytical means, the relationships 
between quantifiable events specified in activity 1 (a) 
above, and the benchmark ADP operations identified in 
activity 1 (b) . Th^fee relationships must be mapped into a 
matrix which lists the ADP operations on gne axis and 
quantifiable events on the other, as shown on the Workload 
Mapping Foi?m, Figure 2 . Exper ience indicates that^ ADP 
systems which support agency functions fall into three 
categories for mapping, defined and treated as follows: 



(1) There is a category,, of ADP systems which are 
executed frequently, at least monthly, and workload is 
a direct function of the quantifiable events. ADP 
systems must be further d i v ided into con t iguous 
subsys-t^ems ; that is, where processing by . a single 
subsystem is performed without intervening gaps in 
time. Identify as category 1, and list, for, each 



USOA QUANTIFIABLE BENCHMARK EVENTS 
AND WORKLOAD PROJECTIONS 



Agency , . ' ♦Date 

. > 



Quantifiable Event 


Volume Per Year 


Percent Per Month 


. Code 


Name • 


1977 


1978 


1979 


1980 


1981 


1982 


JAN 


FEB 


MAR 


APR 


MAY. 


J UN 


JUL 


AUG 


SEP 


OCT 


NOV 


DEC 






















































• 






























^ — — - — , — .. - — 












J » ■ ■ ■ 




























V 
































> 










— 
















* 




































































































































































































^\ - — . 






























































































1 



























































































Figure 1 .. Workload Projection Form 



USDA ASSIGNMKNT OK BKNCIIMARK TO ADP Orr.RATl()NS 



A\\(.Miry 



Code for 
Quant i flab lo 

Event 



AOr SvHtom 
or 

Sub ayr> ( o\n 



AI>r 

Ropi osonl n\ i vv 
. and 




F \.\\u ro '? Wcu k 1 oad M;ipp i n}\ Vovin 




,/ 



subsyqtem: 

o Code assigned' to quantifiable event from 

list 1 (a) above, 
o ADP system/subsystem name, 
o Name and phone number, of ADP consultant, 
o Category (i.e., " 1 " ) of system/subsystem . 
o Displacement (time) in months from incidence 

of event to processing, 
o Under each benchmark ADP operation^^ the number 

of executions per incidence of event, for the 
* ^"_t i Ijje ot the tr^s act ion . . 

(2) The second category of systems/ subsystems i s 
tho^e for which the^re is infrequent (quarterly, 
semi-annual, or annual) ADP processing, and workload is 
a direct function of quan t i f i able . even ts . ADP support 
systems must be further divided into subsystems by 
processing frequency, Specify the same as category 1 
above except identify as category 2,. and use one of the 
following frequency codes in lieu of displacement: 

0 Processed annually at end of calendar year 

1 Processed annually at end of fiscal year 

2 Pr ocessed sem i-annually 
4 Processed quarterly 



(3) The final category consists of 
systems/subsystems for which workload is not a function 
of a quantifiable even^^. Maximum flexibility is 
provided for quantifying and mapping this workload, 
using a comb inat ion of the Workload Pr o jec t ion and 
Wo r kload Mapping forms . Show category 3 for th^ese 
systems/subsystems . The d ispl acement f ^equency^ column 
is not used i^n tallying the workload and rtay be used as 
desired for its information contenii^ The ^ d ist r ibut ion 
of workload will be der'ived from mor](thly percentages 
provided on the Workload Projection Fofm. The best way, 
to learn how to quantify and map categpry 3 wo,jckload is 
to understand how it will be tallied-. For a given 
month, the monthly percentage will be multiplied by the 
appropriate annual workload pro ject iojfi. This product 
will in turn be multiplied by each of/ the ADP operaftion 
quantities for designated systems/suasys terns to yield 
wor)|fload for the month in questiori. Given the three 



B-5 ^ . I f 



39 



value fields to be multiplied together, the actual 
quantities can be manipulated in a variety of ways to 
produce the same results. As with category 1 and 2 
system/Subsystems, category 3 line items on Workload 
Mapping Forlhs are' associated with workload projections 
by using the same quantifiable event code. 

t. 

(e) Select peak workload months.-. The objective o^; this 
activity is' to identify'the peak months of computer workload 
for the combined agencies. This will be done by tallying 
workload Lq^ each month from Workload Projection and Mapping 
forms. DetAiled i^ethodology cannot toe worked out in advance 
because the complexity of the task depends upon all the data 
collected in activities 1 (a) through 1 (d) . If all ADP 
operations peak at the same time, then the selection will be 
obvious. More analysis will be required if disparate peaks 
materialize. Management guidance must-be obtained as to the 
desired level of capability to support peak periods, in 
order to determine how much flattening of peaks is 
appropriate. The output of this activity will be\ the 
identification' of at least two representative p^aks^ 
occurring in the first and final years. if the workload 
changes between these years in volume or composition, in 
other than approximately linear fashion, additional peaks 
must be identified to represent the c1^ar^g»es. 

^ - 

(f) Quantify peak periods. Using the data derived in 
steps 1 (-c) and I (d) , calculate the aggregate number^of 
iterations of each benchmark operation, for all combined 
agencies, required to perform the agency workload during 
each of the peak- per iods . The output of th^s activity will 
be a quantification table for, each peak period, giving the 
number of iterations for eac^h of the benchmark ADP 
operations. ^ 

(g) Determine benchmark transaction characteristics. 
For the purpose of this discussion, a transaction is a coded 
representation of an ev^nt which triggers one iteration of 
one of the benchmark ADP operations discussed in paragraph L 
(b) above. fhis definition will apply whether the ADP 
operation is on-line. or batch, the difference being whether 
the transactions are pres^ented to the system individually at 
the times when the driving events occur, or 'collected into 
batches for processing. This activity . will require 
determining the characteristics of the transactions likely 
to be in the operational systems and assuring that these 



B-6 



^0 



chat acteristics are adequately represented in the tfiinchmark 
programs. ' 

^ 4h) Determine dfrta storage needs' and e+rfNracter i^ics 6t 
th¥ data base. TniS activiti^will copsist/of determining 
.jthe\size of the data base(s) to be stored 'in the 'object 
•TToinputer system, and the characteristics of /tiie» major data 
f ilps, tt^v^vill also require^taking measures to aasiure that 
the ' benchmark adequately. rept'esents these data 
characteristics, ""f^" • s..^ \ 

.2. AQ^'lyze Wojcklpad 

The purpose of this analysis is to . t^^anslate the , workload 
projeatipns into parameters %or the bei|chmark. These 
specific activities will be r equ i neffrlx^ 



(a) -,^rive §yntheti*c program parame_^iC^s . These /include 
the sizes of programs^ >ate of job execliTion, numbers of 
statements exeodt'ed in'e^h program, number,, of copies of 
each program, and transaction rate per copy. ■ ^ . 

/"'' ■■ I 

(b) Develop data storage benchma r k -ol an . The size of 
the data base', number and sizes of filg}s, and file' 
organizations must be decided.- 



(c) Associat?e programs, . t ransactions, and data files. 
Decide the ratio of matching data base records to 
transactions for each transaction type^- " ^ ' 



(d) Derive data generation par am^*t^rs At^^mptv to 
assign keys which will render the Defect 

transaction-to-data-base ratio, and at the same timo' yield 
the^roper data volumes. 

^' Develop and Test Bem^ark Materials 

Thi^ j.is '^a group of activities extending over 'the total 
duration of the- benchmark effort, relate^d . in that they 
require kn^owledge of the benchmark programs and use of 
computers.' Specific activities are: 



^ ^ B-7 



} 



(a) Construction of emulators. In order to test 
synthetic programs on USDA computers, a set of software 
emulators is required to perform the. functions of the 
tcansactian processor arvd data base management system. This 
activity consists of constructing and/or modifying these 
emulators for the current procurement and testing them. 



V 



(b) Retest ^il benchmark componehts. Th\e activity 
consists of generation of test transact ion .and data, via the 
data generator and exercizing all emulators, synthetic 
application programs, and the post processor. 

(c) Update synthetic programs in accordance with new 
>specif icat ionSj 

(d) Modify data generator to produce tran^tions and 
data files in accordance with new specification 

'.^ ^ *' 

(e) Generate new transactions and data. 

(f) Test benchmark and produce c<|^trol values. 

(g) Reproduce materials for vendors. Use a tape copy 
process. Use each new copy to reproduce the next, fina^liy 
validating the last copy against the original. 



ERIC 



Ta 1 ly Process 

A computerized process will tally the workload for any given 
month in the scheduled system life, from data provided on 
Wd-rkload Projection and Workload Mapping forms. The results 
will be an aggregate volume for each ADP operation listed on 
the mapping forms. Detailed steps for the .tally, with a 
, year and month given as parameters, are: 

.1. Initialize a tally for each of the benchmark ADP 
operations. 

2. Process each agency quantifiable event sequentially 
through steps 3 and 4. 

3. Get the workload projection for that event and 

hold. 

^ ji ^ 

4. Process each ADP subsystem for the /function 
according tVo which of the three categories it is i'n, e.g., 

(a) For category 1, subtract displacement (see 
l.d.l) from parameter to obtain month of workload 
origin. Xf it falls earlier than available data, add 
\12 months. Obtain quantification projection for month 
of origin. Multiply the number of execut ions .of each 
ADP operation by the quantification for the month of 

■origin and adt3 to their respective tallies. 

*■ 

(b) The secc^nd category is per iod ic processing 
with frequency codes of 0, 1, 2, or 4. The workload 
for a given system/subsystem will be used only if part 
of the processing is scheduled to fall in the month for 
which the tally is being made. That can be determined 
from Table 1, which shows an example of the allocation 
for each frequency code to months. If the .allocation 
is non--zero for the object month, then the combined 
workload for the months listed in the corresponding 
"use data for" column of Table 1 is determined. That 
is done by multiplying each month percentage by the 
appropriate annual volume and summing the products. 
The allocation for the object month is then applied to 
the sum. This product is then multiplied by the number 
of executions of each ADP operation and the products 
are added to the ir respective tall ies. 

(c) Category 3 1 ine items are treated as those in 
category 1, except that displacement is assumed to be 
0. See paragraph l.d.3 for a discussion of the use o.f 



ERIC 



B-9 



I 



DI5YRIBUTI0N OF PgRIODIC WORKlbAt) 
Category 2 



Table 1. Distribution of Periodic Workload (Category 2) 

4 5 



Allocate 
to: 


Frequency 0 

llftft Allocation 


Frequency 1 
Use Allocation 


Fije< 
Use \ 


!}uency 2 

Allocation 


r requvncy 
Uic All 


A 

ocation 


Data for : 




Data for : 


pata tor 








Dec 


Jan-Dec : 


10% 




July-Dec' 




Oct-'Dac: ' 


5% 


r' — n ^ 

Jan 




507. 


. , > 1 ( 




\ 35% 




80% 


Feb 




40% 






10% 




1?% 


Mar 








Oct-h^ar: 




Jan-Mar: 


5% 


Apr 


— -rr. . 






— ....^ ^. ■ ■ 


j 

/ ■ 357. 




80% 


May 






_ ^ — f 




jio% 




15% 


June 








Jan-^Jvine 


' 5% 


Apr- June : 


5% 


July 






.\ ... ^ — ^ — 1 




35% 


j 


m 


AuR 






r 




10% 






Sept 






Oct-Sept 1 10% 


Apr-Sept 


5% 


July-Sept : 


5% 


.Oct 






50% ... 




35% 




80% 


Nov 






40% 




10% ■ 




15% 



category 3. . d " 

5. Print out the final tallies for each ADP operation. 



\ 



^4- 



45 . 



o 

ERIC 



B-ll 




' ... . . , , . ■ , ■ \ i j» > 




Sample Evaluation Criteria 



The following describes a proposed set of evaluation 
criteria to be used to determine the useability and 
portability of a candidate benchmark library. 



1 . Useabil ity 
1.1. Background 



Recall the definition of "usable" (see Appendix A): 

USABLE - The ability of the potential library of 
synthetic benchmark programs to represent an applicable 
computer workload. A necessary ingredient is an 
effective method of analyzing and ma^i^pwng the workload 
quantification to units which are compatible with the 
synthetic program parameters. 

Implicit in this definition are two necessary components of 
the library: (1) a set of programs that can represent a 
workload; and (2) a set of procedures" that specify how to 
use the library.^ Thus, any evaluation criteria testing 
"useability" should test both of these capabilities. 



Recall also the definition of "represent": 



REPRESENT - The ability of a benchmark to impose the 
same demands on a computer system as the real j obs 
which will be processed on that system during a given 
time frame. 

This requirement is summarized by the following diagram: 



W 



B 



T 



J 



workload djemand 
on S 




benchmark demands 
on S 



ERIC 



c-i 47 



That is, for any given 
benchmark B are run 
approximately tbe same 



system S, if the wprkload W and the 
on S, then W and B should produce 
demands on S. 



The next question is, what do we mean by the same 
demands." The following three requirements define what it 
means for "W and B to p^oduc^ approximately the same demands 
S" : 



o n 



S should be 
10%) of the 
for on-* 1 i ne 
replaced by 



1. The elapsed running time of W on 
approximately the same (e.g., within 
elapsed running time of B on S. NOte, 
applicatiopa, "elapsed, time" cc^uld ,be. 
"response time." 

2. The resource utilization data (e.g., percent 
CPU active, average disk space used, I/O volume 
transferred) when W is run on S should be approximately 
the same as the corresponding data when B is run on S. 

and D should be 



3. The resource profiles of W 
approximately the same. 

Items 1 and 2 seem obvi\)US if one wants B to 
the right system. It is item 3, however, 
expanded discussion. In order to show the 
item 3, especially in a multi-programming 
c^ssume the following situation: 



properly size 
which requires 
importance of 
env i ronment , 



1. Let two .appl icat ions , 
real workload W ' and have 
pr o f i 1 es : 



Al and A2, make 
the following 



up the 
resource 



CPU 



I/O 



CPU--- 



I/O-- 



— ^ -I t ime 

X Y 
Al 



A2 



H t ime 



That is, Al 
of CPU, and 
seconds of I/O 



uses X seconds of 1/0 followed by 

X seconds of CPU followed by Y 



Y seconds 



A2 



uses 



4^ 



2. Assume that benchmarks Bl and B2, which are 
claimed to represent Al artd A2 respectively, have the 
following resource profiles: ^ 



CPU-- 



I/O 



CPU-- 



i/o 



H t ime 



H time 



Bl 



82 



Note that: 

(a) both Bl and 82 have the same elapsed 
times as the applications they each claim to 
represent; and 

(b) both Bl and B2 have the same resource 
utilization data (i.e., CPU and I/O times) as the 
applications they claim to represent. In 
addition, note that Bl has the same profile as Al, 
but 82 and A2 have different profiles. 

3. Assume both applications are now run in a 
multi-programming environment where the CPU and I/O can 
overlap each other: 



CPU 



I/O 



H Al 



-» A2 
H t ime 



Note that the total workload completes in elapsed timet 
X+Y. 

4. Assume both benchmarks Bl and B2, which claim 
to represent Al and hi, are now run in the same 
multi-programming environment: * 



\ 



C-3 



49 



•V 



CPU ■ 



WAIT 



I/O- 



Bl 



^B2 



^ t ime 



Because-'B2 had to WAIT for Bl's I/O demands to 
complete, the elapsed time to run the total benchmark 
was extended to: 2X+Y nearly double that of the 
workload which the benchmark claimed to represent. 

The above example thus points out that it is not sufficient 
for a benchmark to have the same elapsed time and resource 
utilization data as the workload it claims to represent; 
but rather, the benchmark should also have a resource 
profile similar to that of the real workload ~- especially 
in a multi-programming environment. 



1.2. Useabil^\ty Evaluation Criteria 

Based on " the previous discussion, the following 
evaluation criteria would thus determine whether a candidate 
benchmark library is acceptable in terms of "Useabil i ty " : 



Useabil ity Cr it er ia: A benchmark library is 
"usable^^ iTT given an arbitrary workload programs 
from the library can be selected, configured (i.e., 
parameters properly set) , and combined in such a way, 
using established library procedures, so that the 
collection of programs (i.e., the benchmark B) suitably 
represents W. That is, for any arbitrary system S: 

a) the elapsed time of W on S the elapsed time 
of B on S; 

b) the resource utilization of W on S :^ the 
resource utilization of B on S; 



c) the resource profile of W on S « the resource 
profile of B on S. 



1.3". Application of Useability Evaluation Criteria 



Having defined the evaluation criteria which will 
determine whether a candidate benchmark library is usable, 
the next step is to define the procedure for applying the 
criteria. This section will outline a sequence of steps to 
be folilowed which will determine whether a candidate library 
meets the Useability Criteria for a given workload on a 
^igi.ven. system. Note, the ideal test of a library would be to 
apply the Useability Criteria across all workloads and 
across all systems. Because this would not bji^ practical 
the procedure actually defines a set of nece^s^wSPry 
sufficient, conditions for useability. 



out 



not 



Before the procedure which will determine useability 
can be appalled, the following preliminary steps should be 
performed -in order to obtain a test workload W: 




ommon 



to 



1, Identify ADP functions {Fl,.... 
many ag.encies, by: 

(ja) surveying agencies^- e.g., distribute a 
list of ADP functions (e.g., those identified by 
Lucas [9]) ,and have agencies indicate the 
frequency af use and importance of each; 



(b) or, alternatively, identify those 
functions believed to be used by many agencies and 
see if this list is iponsisten t with recent RFP's. 

2. Select from an agency (or create) a set of 
applications {Al,... ,An} that perform the functions 
identified in 1. These applications are real computer 
programs that will make up the test workload W. ..Note, 
each Ai could be composed of many programs. / 

Having constructed a test workload W, the following steps 
are performed to determine the "useability" of a candidate 
benchmark library. The following procedure is optimal in 
the sense that if a benchmark library will fail, it will 
fail early. 



> 



<3 , 

ERIC 



C-5 



Procedure to Determine Useabil ity 



1. Using the benchmark library procedures, create 
a benchmark Bi (a setNof library program^) to represent 
each application Ai of W. That is, apply the library 
procedures to choose the proper programs and parameter 
settings. Call the collection of Bi*s, B. 

2. Run W and B single thread (i.^V, not 
multi-programmed) on several large systems and 
calculate the errors in demancTs as follows: 



^' Elapsed Times ""x^ 

a) Determine the elapsed times of each Bi and 
its corresponding Ai. Note, it is necessary to 
look at individual (Bi, Ai) differences and not 
just total (B, W) elapsed time differences since 
errors could have ' a cancelling effect, as 
illustrated in the following elapsed time charts: 



Al A2 A3 A4 
W: » ♦ 1 



Bl B2 B3 B4 
B: ■ 1 * < 1 

Here, cumulative elapsed times for W and B are. the 
same, but individual ones are not. 

b) For each system on which W and B are run, 
calculate. thp maximum elapsed time relative error: 

(|a1-B1| |a2-B2| 
^ , . . . 
. Al A2 



System 2: E2 



max 



|a1-Bi| |a2-B2| 



Al 



A2 



) 



c ) Fi nd the max imum elapsed time error across 
all^ systems: 

E ^ max (El, E2, • . . ) ^ 

. ' o 

Thus, E represents the rnaximum percent error that 
ever occur red between an appl ica t ion and its 
correspond ing benchmark . ^ For example, if the 
following represented elapsed running times in 
minutes : 





Al 


Bl 


1 Al-Bl 1 
Al 


A2 


B2 


1 A2-B2 1 
. A2 


System 


1 


10 


15 


. 50 


12 


9 


.25 


Sy ste/n 


2 


14 


15 


. 07 


^ 11 


12 


.09 


Sy st/m 


3 


13 


1 3 


. 0 


9 


8 


.11 



then E would equal -50, i.e., the maximum relative 
error across all systems and (application, 
benchmark) pairs. • 



'3- Resource Utilization pat^ 

a) For each major system resource Ri (e.g, , 
Rl = CPU, R2 = core, R3 = disk space used, R4 - 
channel activity, ....), collect -Appropriate 
utilization data when W and B are TUn^n each 
system. 



55 

C-7 



b) Fpr each resource, calculate the resource 
utilization errors between each application and 
its corresponding benchmark; for example, 



Rl: avg. CPU 
ut il i zat ion 



System 1 System 2 



Al 
A2 



avg. CPU avg. CPU 
for A2 - for B2 

avg. CPU for A2 



c) Find the maximum resource errors across- 
all systems. Construct a resource utilization 
error vector : 

R = (max. CPU utilization error, 

max. core utilization error,...) 



a) For each major system resource^ ob-tain a 
profile across time of resour^ce usage for each 
application and its corresponding benchmark. For 
example, on System 2 the CPU profiles for A3 and 
D3 might look like: 



A3 



100% 



CPU 
Ac t ive 




tl 



t2 



t3 



t4 



ERIC 



c-8 



^4 



B3 : 



100% 



CPU 
Active 





• 


_ 












L- >^ 








> 






tr 


t2 


t3 


t4 





b) Appiy 4j^^stat isjt ical techn iques to all 
profiles for each, resource and determine the 
profile pairs least like each other.. Quantify 
this discrepancy in terms of relative^ error or 
confidence limits. , ^ . \ 

c) Construct a profile error vector: 

P = (max. CPU profile error, ^ 
max. core profile error,...) 

In summary,^ ;the valu^ E and the vectors R and P 
thus tell, ; in quantifiable terms, how close (in 
demands) B is to W, 



3. Determine if B has passed the ^ usebility test 
thus fax. That SS^ see if E, R and P are within 
acceptable bounds (mg., E < 10%). If not, the 
candidate benchmark 1 ibrary •fails. If B passes so far, 
continue with the next steps. 



4 . . 

workload 
Kelow . 



Construct a' transaction 
W. Repeat^ . steps 1--3. 



processing test 
If pass, continue 



^ ■ 5.. , Try a combination batch and transaction 
-processing workload and rej^eat steps 1-3. If pass, 
continue . - * ^ . 



6* Try ^11' of the above in a\ mul t i-programming 
environment. • • . ' 



The above procedure will determine if a candidate benchmark 



ERIC 



C-9 



55 ' 



library can adequately represent existing application 
programs. ,A further question is how close the benchmark 
library can come to representing applications specified with 
less and less knowledge -- iy.e.^ closer to the functional 
specification level. 



2. Portability 
2 . 1 w Backg r ound 

Recall the definition of "portable": 



PORTABLE - A requirement of synthetic programs in the 
benchmark library to represent a specified amount of 
work on different computers without undue bias 
resulting from differences among the computers and 
their systems software. Also refers to the ability of 
benchmark programs to run on different systems with 
little or no source-code changes. 



Thus, the benchmarks constructed from tbe library must have 
two necessary qualities: (1) they must contain standard 
language and data constructs; and (2) they must not "unduly 
bi/as" one system over another. It is clear v^at the ^irst 
crViterion means. What is not clear is the- meaning of 
"uriduly bias." The following discussion addresses this 
latter point. T 

• A benchmark should adequately represent ^ workload so 
that the ability of one system to handle the workload better 
than another system xs reflected in the benchmark running 
tiipes, resource usage, etc. That is, the benchmark should 
reflect the same "natural biasing" that will take place when 
the real workload is ,,,,.i;JiJn .— this, after all, is what 
benchmarking is all about. The problem, of course, is that 
the" benchmark should not perform additional activities which 
are not needed to reptesent the workload since these 
additional act>^ivities are subject 'to different system 
transformations and henj&e may skew the benchmark results. 



How does one then determine if a - benchmark is 
performing th.ese "additional activities" --.that' is, j. f it 
i^ undyly biased? One* of the only practical ways is to 



C-10 ^^>' 



determine if the benchmark is placing more resource demands 
on the system than the real . workload would. The assumption 
here, of course, is that if the benchmark were performing 
'•additional activities" they would be reflected*^ in 
additional demands . ^ Th is assumption appears correct except 
in the case in which adc^itional (or insufficient) demands 
cancel each other with the net effect that the benchmark has 
similar aggregate demands as the workload, though different 
activities^ Furthermore, it is necessary to assume^that the 
application programs from ^which the benchmarks are 
constructed are themselves unbiased. 



2,2. Portability Evaluation Criteria 

The following evaluation criteria would thus determine 
whether a candidate benchmark library is acceptable in terms 
of "port ability": ' >^ 



Portabil ity Cr i ter ia : A' benchmark library is 
"portable" Tf, giv&n an arbitrary workload W, a 
benchmark B can be constructed which: 

a) contains only standard language and data 
constructs; and 

b) does not place additional demands on an 
arbitrary system S as would wX^f W were actually run on 
S (i.e., &6es not unduly bias one system over another). 



2.3. Application of Portability Evaluation Criteria 

The procedure for applying the P(5rtability Criteria to 
^a candidate benchmark library can< as it turns out, be 
performed in parallel with the " useabi l^i ty " test described 
earlier^ Having constructed a bencVimark B to represent a 
test workload W, B could be examined either manually or 
automatically to determine if it contains any non-standard 
language constructs. Secondly, during the running of W and 
on various systems, the resource utilization and profile 
^ata collected for the "useability" tests can also be used 
to determine whether (and how much) B is unduly biased 
(since, as has already been stated, unduly biased means B 
places diffetent demands on the system than does W) . In 
fact, the resource utilization error matrix developed 
earlier will tell whether the benchmark is biased by 
application (comparing the matrix rows) or by system 
(comparing the matrix columns). 

r 



